Prosper Marketplace, Inc. is a San Francisco, California based company that does peer-to-peer lending. It’s the first peer-to-peer lending marketplace in the industry, with over $7 billion in funded loans. Borrowers request personal loans on Prosper, while investors can fund the loans, considering the borrower’s credit scores, ratings, histories, and category of the loan. Prosper handles the servicing of the loan, and collects and distributes payments and interests to the investors.
For this project, Udacity provided a sample of the loan data from Prosper (last updated on 03/11/2014). The data can be downloaded here and the variable dictionary is here
With the data loaded, I’ll do a quick check on the types of variables in this large dataset.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 8 levels "A","AA","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2802 levels "2005-11-25 00:00:00",..: 1137 NA 1262 NA NA NA NA NA NA NA ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 7 levels "A","AA","B","C",..: NA 1 NA 1 5 3 6 4 2 2 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
## $ Occupation : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
## $ EmploymentStatus : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 706 levels "00343376901312423168731",..: NA NA 334 NA NA NA NA NA NA NA ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11585 levels "1947-08-24 00:00:00",..: 8638 6616 8926 2246 9497 496 8264 7684 5542 5542 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Since ProsperScore and ProsperRating..Alpha can only take a limited number of different values, I’m converting them into factor variables and rearranging them to display better.
## Ord.factor w/ 11 levels "1"<"2"<"3"<"4"<..: NA 7 NA 9 4 10 2 4 9 11 ...
## Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: NA 6 NA 6 3 5 2 4 7 7 ...
This dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
## HR E D C B A AA
## 6935 9795 14274 18345 15581 14551 5372
It seems the loan ratings are normally distributed, with “C” ratings being the most frequent.
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
It seems the distribution of Prosper Scores is similar to the distribution of Prosper Rating. The most concentrated area is between scores 4 to 8.
The majority of the borrowers are within the $25k to $75k range. The surprising thing is that within this dataset, people within the range of $1-24,999 did not borrow as frequently as any other group. One would think these people need the most financial help. Perhaps Prosper did not go over this segment of people.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
It seems most people took loan amounts of under $10,000. One interesting observation is that the # of loans spike at $5,000 intervals, as seen in $10,000, $15,000, $20,000, etc. Is it possible people lean towards amounts by the $5,000s? Or perhaps Prosper has a selection list of amounts that are multiples of $5,000, and let customers specify amounts if their desired amount is not in the selection list?
## AK AL AR AZ CA CO CT DC DE FL GA HI
## 200 1679 855 1901 14717 2210 1627 382 300 6720 5008 409
## IA ID IL IN KS KY LA MA MD ME MI MN
## 186 599 5921 2078 1062 983 954 2242 2821 101 3593 2318
## MO MS MT NC ND NE NH NJ NM NV NY OH
## 2615 787 330 3084 52 674 551 3097 472 1090 6729 4197
## OK OR PA RI SC SD TN TX UT VA VT WA
## 971 1817 2972 435 1122 189 1737 6842 877 3278 207 3048
## WI WV WY NA's
## 1842 391 150 5515
Nothing too interesting here - states with large cities have more people, and therefore account for more loans than states with smaller cities.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 16965 58308 7433 7189 2395 756 2572 10494 199 85 91 217
## 12 13 14 15 16 17 18 19 20
## 59 1996 876 1522 304 52 885 768 771
##
## Not Available Debt Consolidation Home Improvement
## 16965 58308 7433
## Business Personal Loan Student Use
## 7189 2395 756
## Auto Other Baby&Adoption
## 2572 10494 199
## Boat Cosmetic Procedure Engagement Ring
## 85 91 217
## Green Loans Household Expenses Large Purchases
## 59 1996 876
## Medical/Dental Motorcycle RV
## 1522 304 52
## Taxes Vacation Wedding Loans
## 885 768 771
One category stands far above the rest, and that is Debt Consolidation. This makes sense, since a lot of people in debt can potentially have high-interest loans from elsewhere. Getting a great rate from Prosper could save on massive amount of interest from those high-interest loans.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
It seems a sizeable number of loans are bought out by individual investors, with the majority of the loans bought out by fewer than 100 investors. Not quite surprising, since loan amounts (as seen before) are usually less than $10,000, which can be covered a single investor comfortably.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1242 0.1730 0.1827 0.2400 0.4925
Based on the plot, most yields are between 5% to 35%, with most of them concentrated near 17%.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
It seems the majority of all loans in the dataset are in good standing. If anything else, this reflects nicely on Prosper as a platform for peer-to-peer lending, since many people, including me, had doubts about the safety of investing in non-traditional loans.
The Prosper Loans dataset has 113937 observations and 83 variables. The variables contain 3 classes - numeric, factor, and int. The variables I explored in the Univariate Plotting section included the
ProsperRating: Factor variable with 7 levels ProsperScore: Factor variable with 11 levels IncomeRange: Factor variable with 8 levels LoanOriginalAmount: Integer variable BorrowerState: Factor variable with 51 levels ListCategory: Factor variable with 20 levels Investors: Integer variable LenderYield: Numeric variable Status: Factor variable with 4 levels
I want to determine whether the Loan Status of a loan is connected or affected by certain variables, such as yields or whether the income or ratings of the loaners are good. I also want to check out how the Lender Yield is affected by things like credit scores and Prosper scores.
Credit scores and loan terms can be considered as well. Since credit score is tied to how consistent borrowers can pay back what they borrowed, it would make sense for investors to invest in loans by borrowers with good credit scores, or else borrowers with bad credit score can just default or delay payment. On the other hand, loan terms may be important to certain people, whether they are looking for something long term or short term.
Yes, I introduced the ProsperRating and Status variables, and set them to be ordered factor variables.
There were some unusual distributions, such as Number of Loans by Investors and Number of Loans by Loan Amount. These two are mostly skewed to the right.
I did factorize and ordered a few variables. I did this because statistical models treat numeric and factor variables differently, as well as unordered and ordered factor variables. To make sure the models calculate using the correct method, I had to make sure the variables I’m investigating are of the right data type.
## LoanOriginalAmount Investors LenderYield
## LoanOriginalAmount 1.0000000 0.3800926 -0.3284551
## Investors 0.3800926 1.0000000 -0.2741739
## LenderYield -0.3284551 -0.2741739 1.0000000
A quick correlation matrix shows that the continuous variables LoanOriginalAmount, Investors, LenderYield, CreditScoreRangeAvg, and Term do not have a strong correlation amongst each other. I do want to check out the relationship (if any) between Score and Rating with the Investors variable.
Based on this scatterplot, it seems rating and score have a positive linear relationship; higher ratings tend to result in higher scores. I suspect that the relationships between Investors and both Rating and Score should be similar. Let’s find out.
## prosper$ProsperScore: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 16.00 35.00 40.78 58.00 293.00
## --------------------------------------------------------
## prosper$ProsperScore: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 10.0 30.3 45.0 470.0
## --------------------------------------------------------
## prosper$ProsperScore: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 9.50 35.24 52.00 483.00
## --------------------------------------------------------
## prosper$ProsperScore: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 6.00 36.88 53.00 833.00
## --------------------------------------------------------
## prosper$ProsperScore: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 26.00 51.72 74.00 1024.00
## --------------------------------------------------------
## prosper$ProsperScore: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 30.00 59.62 88.00 821.00
## --------------------------------------------------------
## prosper$ProsperScore: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 33.00 67.78 104.00 1035.00
## --------------------------------------------------------
## prosper$ProsperScore: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 8.0 70.0 105.7 162.0 1189.0
## --------------------------------------------------------
## prosper$ProsperScore: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 23.0 92.0 121.5 183.0 659.0
## --------------------------------------------------------
## prosper$ProsperScore: 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 88.0 133.6 206.0 779.0
## --------------------------------------------------------
## prosper$ProsperScore: 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 94.3 169.0 714.0
## # A tibble: 12 x 2
## ProsperScore n
## <ord> <int>
## 1 1 992
## 2 2 5766
## 3 3 7642
## 4 4 12595
## 5 5 9813
## 6 6 12278
## 7 7 10597
## 8 8 12053
## 9 9 6911
## 10 10 4750
## 11 11 1456
## 12 NA 29084
It seems that as we go higher on the Prosper Scoring scale of 1 through 11, the median of number of investors for a given loan goes up. The interesting thing is at the score of 11, the median is 1. One explanation may be that since the score is so high and the loan yield is almost guaranteed, individual investors buy out the entirety of the loan amount.
This plot maintains my original claim that better score/rating usually results in higher number of investors. Now let’s examine a few other relationships.
This graph is showing a peculiar relationship: it seems that as the income level goes up, there is more money borrowed in a loan. One would think having a higher income would reduce the need to take out a loan for high value assets like houses, cars, etc. But quite on the contrary, the relationship between income and loan amount is direct, not inverse. Perhaps people of different income levels have different priorities - people with high income may categorize things like starting a business or other expensive ventures as high priority. People with low income would probably categorize paying off debt as high priority. It’s also possible that people with higher income tend to buy the same things as others, but higher quality, hence a slightly increase in median loan amount. This prompts a Loan Category vs Income Range exploration.
Judging by this chart, debt consolidation is a huge piece of the puzzle, for people from all socio-economic classes. Without further debt data to breakdown, it’s hard for me to tell why people with high income have to take out larger loans than people with lower income. Is it their spending habit? Or perhaps in certain areas (Bay Area), $100k salary is considered only average? Moving on.
## [1] -0.2741739
As previously observed in the matrix table, higher lender yield actually results in a slight decrease in number of investors. Perhaps higher yield usually means higher risk, and it’s a common phenomenon that more people are risk-averse than not.
Having high income doesn’t neccessarily mean the borrower will not default. Having high Prosper Rating and Scores does seem to affect the number of defaults - the occurances of Past Dues and Defaults seem to become very rare at very high score/rating (when score = 10 or 11, rating = A or AA). This, plus the original plot of “Number of Investors vs Prosper Rating”, makes a lot of sense, since investors would not appreciate their investments go down the drain.
We now see a much clearer picture. Higher scores usually result in lower yield, since there is lower risk involved.
By looking at the boxplots, the picture becomes much more clear. I will discuss more about the relationships in the Bivariate Analysis section below.
Based on the last bivariate plot, when comparing good-standing loans to loans that are not, I found that: 1) Loans that are in good-standing have higher median loan amount, 2) Loans that are in good-standing have fewer median number of investors, 3) Loans that are in good-standing have lower median yield, 4) The people requesting loans that are in good-standing have a higher median credit score, and 5) No major difference in loan duration between good-standing loans and the delinquent ones.
Some other interesting relationships I found are: * Median Yield goes down as Prosper Score goes up * Number of Investors goes down as Yield goes up * Out of all the listing categories, a huge chunk of loans belong to the “Debt Consolidation” category, and also people who are relatively well-off (salary of $75k or above) have most of their loans in this category. * Loan amount increases as income level increases * Median number of Investors goes up as Prosper score or rating goes up
The strongest relationship is definitely Median Lender Yield vs Prosper Score. It confirms our common knowledge that the higher the Prosper Score, the lower the yield. “High risk, high reward”, as people always say.
What I discussed before in the bivariate analysis section still can be seen here: as Prosper Score increases, yield decreases for loans of all statuses. Although, loans that are in good-standing have slightly lower yield than the others across the board. This is probably because there are other variables affecting yield, such as current delinquent amount or credit scores.
Similar to the previous plot, yield decreases as credit score increases, for all loan statuses. Once again, loans that are in good-standing have lower yield than loans that are delinquent.
This graph brings out some new insight on credit scores and how it is a relatively good indicator of whether a loan is going to go bad. The median credit score of borrowers with loans that are either defaulted or past due do not surpass 690, where as the median credit score of borrowers with loans that are in good-standing surpass 690, and even more as income level increases.
##
## Calls:
## m1: lm(formula = LenderYield ~ ProsperScore, data = prosper)
## m2: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount,
## data = prosper)
## m3: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount +
## CreditScoreRangeAvg, data = prosper)
## m4: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount +
## CreditScoreRangeAvg + AmountDelinquent, data = prosper)
## m5: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount +
## CreditScoreRangeAvg + AmountDelinquent + Status, data = prosper)
##
## ======================================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------------------
## (Intercept) 0.184*** 0.213*** 0.496*** 0.495*** 0.531***
## (0.000) (0.000) (0.003) (0.003) (0.003)
## ProsperScore: .L -0.219*** -0.192*** -0.165*** -0.165*** -0.159***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: .Q -0.010*** -0.009*** -0.006*** -0.006*** -0.008***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: .C -0.003** 0.003** 0.002* 0.002* 0.007***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: ^4 0.026*** 0.030*** 0.021*** 0.021*** 0.016***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: ^5 0.005*** 0.006*** 0.003*** 0.003*** 0.005***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: ^6 -0.006*** -0.004*** -0.008*** -0.008*** -0.009***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: ^7 0.004*** 0.002** 0.002*** 0.002*** 0.002**
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: ^8 0.005*** 0.003*** 0.003*** 0.003*** 0.002***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## ProsperScore: ^9 -0.005*** -0.004*** -0.006*** -0.006*** -0.006***
## (0.001) (0.001) (0.000) (0.000) (0.000)
## ProsperScore: ^10 0.008*** 0.007*** 0.007*** 0.007*** 0.007***
## (0.001) (0.001) (0.000) (0.000) (0.000)
## LoanOriginalAmount -0.000*** -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000)
## CreditScoreRangeAvg -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000)
## AmountDelinquent 0.000*** 0.000***
## (0.000) (0.000)
## Status: Past Due/Defaulted -0.025***
## (0.001)
## Status: Current or Paid/Defaulted -0.047***
## (0.001)
## ------------------------------------------------------------------------------------------------------
## R-squared 0.440 0.504 0.556 0.557 0.585
## adj. R-squared 0.440 0.504 0.556 0.557 0.585
## sigma 0.056 0.053 0.050 0.050 0.048
## F 6664.472 7823.693 8869.210 8191.290 7959.190
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 124404.571 129521.286 134299.039 134311.792 137080.870
## Deviance 264.690 234.618 209.631 209.568 196.326
## AIC -248785.142 -259016.573 -268570.077 -268593.584 -274127.741
## BIC -248672.958 -258895.040 -268439.196 -268453.354 -273968.813
## N 84853 84853 84853 84853 84853
## ======================================================================================================
The linear model I built is to predict the yield of a loan, with the formula (LenderYield ~ ProsperScore + LoanOriginalAmount + CreditScoreRangeAvg + AmountDelinquent + Status). As I was adding these variables (that I thought may help me explain the variability of the response data), I kept an eye on the R-squared value. With those 5 predictor variables, the model can explain about 59% of the variability of the data, which is quite decent.
Based on the median yield vs Prosper Score by Loan Status plot, loans with lower Prosper Scores usually result in higher yield as well. Same deal when looking at the yield vs credit score plot; loans with lower credit scores usually result in higher yield. However, when specifying a third variable - Loan Status - to both plots, we can see that loans in good-standing have lower yields than loans that are delinquent, given the same scores.
Out of curiosity, I generated a linear regression model to explain the variability of the data, with minor success. The variables (ProsperScore, LoanOriginalAmount, CreditScoreRangeAvg, AmountDelinquent, Status) chosen by me made sense in predicting the Lending Yield of loans, and as I added them one-by-one to the linear model, I knew each of these variables played a part in predicting the yield, since my R-Squared value increased after each addition.
No. The relationships all make sense for the most part.
Yes, I created a linear model for this dataset to predict Lender Yield.
The strengths of my model include: 1. The R squared value is relatively high, at 0.585, which means it explains about 59% of the variability in this dataset. 2. I kept the number of predictor variables low, a total of 5. This should alleviate the concerns of overfitting.
The limitations of my model include: 1. My model is sensitive to outliers. This is twofold: I did not take measures to clean the dataset of any outliers when calculating the linear model, therefore the model may be affected drastically if the outliers are significant. Also, since the model explains the variability of the data in a set range, if I were to try and predict the yield of something outside of that range, the extrapolation may not be accurate. 2. I did not explore any other modeling options, such as polynomial regression, which may create a better model and predict the yield better. 3. I did not transform any of the predictor variables into logs or square roots, which may create a better model and predict the yield better.
This is a multivariate plot, summarizes the relationship between Lender Yield, Prosper Score, for each Loan Status. Loans with the best status - Current or Paid - has lower median Lender Yield than loans with ‘Past Due’ or ‘Defaulted’ statuses. This graph captures the conventional saying of “high risk, high reward”.
This stacked bar chart gives us a glimpse of why people go onto prosper.com and ask others for loans. The majority of loans taken out from Prosper is for “Debt Consolidation”. Out of this category, almost all of the loans are taken out by people with income. It’s also interesting to see quite a few people who have $100k salary take out loans for consolidating debts.
The boxplot shows how attractive loans with high scores are. Judging by the medians of each score, increasing the score usually means more investors funding the loans. The number of investors funding an ‘AA’ loan is drastically higher than loans with any other scores, including ‘A’.
This was a large dataset with over 80 variables, so I felt going through the entire analysis process was quite an accomplishment. With barely any knowledge in the financial industry, I was surprised how far I could get with common sense, a large dose of curiosity, and R coding skills.
One of the challenges I faced while analyzing this dataset was understanding what each of the variable stood for and their meaning. After looking through the variable dictionary a few times, I let my common sense kick in and picked a few variables that should have some obvious relationships, and go from there. Another challenge was choosing which type of visualization would prove most useful to display the relationships amongst variables, and constantly going through stackoverflow tips on using different functions for ggplot2.
As I made progress in the analysis, I quickly realized that there was a lack of a main feature of interest. Depending on who is in possession of this dataset, that person may see a feature of interest that is completely different from the next person. Could it be whether a borrower is delinquent? The amount that is delinquent? Predict whether the status of a loan is good or not? Predict the lender yield? Or is it to figure out how Prosper rates and scores each loan? By the time I moved onto bivariate analysis, my original feature of interest, “Loan Status”, kind of shifted to “LenderYield”, and I had to make adjustments to previous graphs, summaries, and transformations, etc. In short, staying focused on 1 single feature of interest was very hard. However, once I made that shift, I explored potential relationships that are reasonable, and created a linear model that does a decent job predicting the Lender Yield.
One thing I realized later in the analysis was I could perhaps transform or combine variables, and utilize other types of models, like logistic regression to predict categorical varialbles such as Loan Status. Other data manipulation techniques such as checking and transmuting outliers may also help with the predictive power of my current linear model.